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(57) ABSTRACT 

The presence of a voice in an audio signal is detected by 
sampling frequency components of the audio signal during 
a window that starts when a power of the audio signal 
reaches a predetermined threshold and stops when the audio 
signal's power drops below the predetermined threshold. An 
array of elements is generated based on the sampled fre- 
quency components. Each element in the array corresponds 
to a time-based sum of frequency components. Whether the 
audio signal corresponds to a voice is determined using one 
or values calculated from the generated array. The value may 
correspond either to a frequency-based sum of array ele- 
ments or to the window. The calculated values are analyzed 
using fuzzy logic which generates a measure of a likelihood 
that the audio signal is a voice. 
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VOICE DETECTION IN AUDIO SIGNALS FIG. 3 is a flow chart of operation of the voice detector of 

BACKGROUND ^' vo * ce detector 10 initially determines if the 

. . . ..... , incoming audio signal 12 is digital in format (step 32). If the 

This invention relates to identifying a presence of a voice audio ^ ^ k me ^ w rforms a 

m audio signals, for example, m a telephone network. s discrctc f ourier ^ ^ ^ on th ^ di ^ itized 

An audio signal can be any electronic transmission that si , ( 3fi) Jf ho , he audio s%Qal fc 

conveys audio information. In a telephone network, audio *u *u rAnrm^ i a. j- • i * -c 1 

/ . , , . fC i j i * n-c_ m en tne CODEC 14 samples the audio signal at a specified 

signals include tones (for example, dual tone multifrequency - A * u* ■ j ■* i * *• c *t_ j- 

/i?™*™. « • | u - * \ . J penod to obtain a digital representation 16 of the audio 

(DTMF) tones, dial tones, or busy signals), noise, silence, or „• w . Ti - j* * m _c r\r~r 

u ■ i \r • j . j-rr • t. . , signal (step 34). Then the voice detector 10 performs a DFT 

speech signals. Voice detection differentiates a speech signal t f ^ r 

, io at step jO. 
from tones, noise, or silence. 

One use for voice detection is in automated calling J mn \ c1 ^ SU ° h * ^ uen ^; do * aiD maxima are 

systems used for telemarketing. In the past, for example, a c * r * ctcd *°? the signal (step 38) and are compared to 

company trying to sell goods or services typically used ^ u th J eshol ? S ( T 40) ^ Parameters exceed 

several different telemarketing operators. Each operator „ ** A . thresho ^ the vo ! ce <k f ctor 10 determines that the 

would call a number and wait for an answer before taking audl ° *»<™P°nds to a human voice in which case the 

further action such as speaking to the person on the line or ^ detec j or 10 £P orts the P resence of the volce * the 

hanging up and calling another prospective buyer. In recent au 10 S1 ^° a ' ej) 

years, however, telemarketing has become more efficient In step 38, the parameters extracted from the audio signal, 

because telemarketers now use automatic calling machines 2Q sucn as me frequency-domain maxima, may, for example, 

that can call many numbers at a time and notify the telemar- correspond to formant frequencies in speech signals. For- 

keter when someone has picked up the receiver and mants are natural frequencies or resonances of the human 

answered the call. To perform this function, the automatic vocal mat 0GCar because of the tubular shape of the 

calling machines must detect a presence of human speech on tracl - '^ nere are main resonances (formants) of signifi- 

the receiver amid other audio signals before notifying the 25 cance m numan speech, the locations of which are identified 

telemarketer. The detection of human speech in audio sig- b y the voice detector 10 and used in the voice detection 

nals can be achieved using digital signal processing tech- analysis. Other parameters may be extracted and used by the 

niques. voice detector 10. 

FIG. 1 is a block diagram of a voice detector 10 that Voice detection analysis is complicated by the fact that 

detects a presence of a voice in an audio signal. A time 30 f° rm ant frequencies are sometimes difficult to identify for 

varying input signal 12 is received and a coder/decoder low-level voiced sounds. Moreover, defining the formants 

(CODEC) 14 may be used for analog-to-digital (A/D) con- for unvoiced regions (for example, region 30 in FIGS. 2A 

version if the input signal is an analog signal; that is, a signal and 2B) is impossible, 
continuous in time. During A/D conversion, the CODEC 14 
periodically samples in time the analog signal and outputs 

a 3J SUMMARY 

digital signal 16 that includes a sequence of the discrete Implementations of the invention may include various 

samples The CODEC 14 optionally may perform other combioations of me follo wing features, 

coding/decoding functions (for example, compression/ , . " . . . , <, 

decompression). If, however, the input signal 12 is digital, In one general aspect, a method of detectmg a presence of 

then no A/D conversion is needed and the CODEC 14 may <n 4 V0ice m an ^" dl ° ^ «W« «"°Phng frequency 

be bypassed components of the audio signal during a window that starts 

. ... .. , . .... .. .. , when a power of the audio signal reaches a predetermined . 

In either cue, to ^digital signal 16 ..provided toadigHal u^,^ st when sigoiV * ^ drops . 

signal processor (DSP) 18 which extracts informaUon from be , ow , he mresho ld. & me ^ od ^ 

the signal usmg frequency domain techniques such as Fou- * , . . , 1 

i cur j ■ r comprises generating an array of elements based on the 

ner analysis. Such frequency-domain representation of 45 j f : u 1 , f4 u 

. ' , 4l - \.. / , . v . . , A sampled frequency components, each element of the array 

audio signals greatly facilitates analysis of the signal. A V. «r * ' l j r 

*- m * 1 . .1 rJL n - 0 . j « 4 , corresponding to a time-based sum of frequency compo- 

memory section 20 coupled to the DSP 18 is used by the . V. ,.1 , . j 4 4 - j » • 

ncD / . . a *• • a. a ■ *_ nents. The method makes a voice detection determination 

DSP for storing and retrieving data and instructions while Kqc ^ „ n ^ nr m „„ „ a1lwlf . rtQ i„„i Qt ^ r t , m „^*~a 

1 * .< j. 1 j. • t based on one or more values calculated from the generated 

analyzing the digital audio signal 16. t-ui ^ -*l * c i_ j 

5 A . 7 . array. Each value corresponds either to a frequency -based 

FIG. 2A shows an example of a human speech audio 50 sum of elements or to the window, 

signal 22 represented as an analog signal that may be input . 

into the voice detector 10 of FIG. 1. Furthermore, FIG. 2B Embodiments may include one or more of the following 

shows a digital signal 24 that corresponds to the input analog ca rcs * 

signal after it has been processed by the CODEC 14. In FIG. A value corresponding to a frequency-based sum of array 

2B, the analog signal of FIG. 2A has been sampled at a 55 elemenls ma Y be a ratio of a frequency-based sum of array 

period T 26. Voiced sounds, such as those illustrated in elements in a lower frequency range and a frequency-based 

region 28 of FIGS. 2A and 2B, generally result in a vibration sum of arra y elements in a higher frequency range. A value 

of the human vocal tract and cause an oscillation in the audio corresponding to a frequency-based sum of array elements 

signal. In contrast, unvoiced speech sounds, such as those mav bc a ration of a maximum-value array element in a 

illustrated in region 30 of FIGS. 2Aand 2B, generally result 60 lower frequency range and a frequency-based sum of array 

in a broad, turbulent (that is, non-oscillatory), and low elements in the lower frequency range other than the 

amplitude signal. The frequency domain representation of maximum-value element. 

the human speech signal of FIG. 2B, for example, displays Prior to sampling, the power of the audio signal may be 

both voiced and unvoiced characteristics of human speech estimated. 

that may be used in the voice detector 10 to distinguish the 65 The determining may comprise analyzing the calculated 

speech signal from other audio signals such as tones, noise, values using fuzzy logic, in which analyzing comprises 

or silence. generating a degree of membership in a fuzzy set for each 
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value. The degree of membership, which may be based on 
a statistical analysis of audio signals, may represent a 
measure of a likelihood that the audio signal is a voice. The 
analyzing may comprise combining degrees of membership 
for each value into a final value and converting the final 5 
value into a voice detection decision. The final value may be 
converted into a decision by comparing the final value to a 
predetermined threshold. 

The audio signals may occur on a telephone line. 
Likewise, the audio signals may occur in a computer tele- 10 
phony line. 

The methods, techniques, and systems described here may 
provide one or more of the following advantages. The voice 
detector is implemented using digital signal processing 
(DSP) and fuzzy analysis techniques to determine the pres- 15 
ence of a voice in an audio signal. The voice detector 
provides higher reliability and greater simplicity since fea- 
tures are extracted from the averaged spectrum of the 
incoming signal and fuzzy (as opposed to boolean) logic is 
employed in the voice detection decision. Furthermore, the 20 
voice detector is adaptable since fuzzy logic parameters may 
be adjusted for different telephone calling locations or lines. 
This adaptability, in turn, contributes to higher voice detec- 
tion reliability. 

Other advantages and features will become apparent from 
the detailed description, drawings, and claims. 

DRAWING DESCRIPTIONS 

FIG. 1 is a block diagram of a detector that can be used 30 
for detection of a voice. 

FIGS. 2A and 2B are graphs of a speech signal 
represented, respectively, as an analog signal and as a 
sequence of samples. 

FIG. 3 is a flowchart of voice detection of FIG. 1 that uses 35 
frequency-domain parameter extraction. 

FIG. 4 is a block diagram showing elements of a voice 
detection analysis technique based on several averaged 
frequency-domain features. 

FIG. 5 is a graph of a generalized fuzzy membership 
function. 

FIG. 6 is a flowchart illustrating the voice detection of 
FIG. 4. 

DETAILED DESCRIPTION 45 

Certain applications in telecommunications require reli- 
able detection of speech sounds amid tones such as call- 
progression tones or dual tone multifrequency (DTMF) 
tones, noise, and silence. In general, voice detectors that 50 
recognize speech based on frequency-domain maxima are 
relatively unreliable because only a few frequency-domain 
maxima are used and complete spectrum information of a 
"word" is ignored. (A "word" is any audio signal with 
energy, that is, an amplitude of the frequency spectrum, large 5s 
enough to trigger voice detection analysis.) In contrast, a 
voice detector that utilizes several average values from a 
substantially complete frequency-domain audio spectrum 
and fuzzy logic techniques provides simpler 
implementation, greater flexibility, and higher reliability. \ eo 

FIG. 4 shows a block diagram of such a voice detector 50 
that uses several frequency-domain averaged features and 
further employs frizzy logic for making the voice detection 
decision. A digital audio signal x(n) (block 16) serves as an 
input for the voice detector 50, where n is an index of time. ^5 
Periodically, a power estimator 52 estimates the power of the 
incoming signal sample x(n). Power estimation may occur 



every 10 ms, a length of time much shorter than the duration 
of a spoken word in human speech. A word boundary 
detector 54 compares the power of the incoming signal 16 to 
a predetermined word threshold (WORD_THRESHOLD). 
If the audio signal's power exceeds WORD_ 
THRESHOLD, then the digital signal 16 is provided to a 
block 56 which performs a fast Fourier transform (FIT) on 
the incoming samples x(n). Output of the block 56 at time 
t and at frequency to £ is a frequency-domain representation 
Y/co^) of the incoming audio signal x(n), where co, is (2jt/T)i, 
i is a frequency index and T is a length of a fetch which is 
used to compute the FFT. Y^oO is provided to a spectrum 
accumulator 58. The spectrum accumulator 58 sums corre- 
sponding spectral components for a time window T: 



>>>;) = £in(<*)f 



(1) 



where lY^co^)! is an absolute value of the output of the FFT 
at a time t for a frequency Gv<2jt/r)i e [250, 2500] Hz. This 
frequency range is selected because it encompasses most of 
the energy of the speech signal. The time window starts 
when the power of the audio signal reaches WORD„ 
THRESHOLD and stops when the audio signal's power 
drops below the WORD_THRESHOLD. Therefore, spec- 
trum accumulator 58 averages over a complete duration of 
the "word" defined by the window which, for example, may 
correspond to a word such as "hello" or a DTMF tone. A 
switch 60 closes when the accumulation stops — that is, 
when the power drops below WORD_THRESHOLD. 
Accumulation at block 58 is a sum over time; thus output Y^ 
of the accumulator block 58 is an array independent of time 
and indexed in frequency by i: 



40 



] (2) 



where max is a maximum frequency index. 

When the switch 60 closes, output of spectrum 5 accu- 
mulator 58 is provided to feature extraction blocks 62, 64, 66 
which calculate values based on elements in the array Y^. A 
first block 62 calculates feature LI; a ratio of a sum of 
lower-frequency spectrum components to a sum of higher- 
frequency spectrum components in Eqn. 2: 



^ t e( 250.6*0] Hz 
<n,C [7 50.2500) Hz 



(3) 



If the audio signal has a frequency spectrum that spans the 
range [250, 2500] Hz of frequencies, then LI would be on 
the order of 1. 

A second block 64 calculates feature L2, a ratio of a 
maximum value (MAX) of the lower-frequency elements in 
the 15 array to a sum of all other lower-frequency elements 
in the array: 
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verts the final fuzzy set 76 into a classical boolean set — that 

a = MAX [250, 680] Hz (4) is, {0,1}. The value of F, which ranges from 0 to 1, is 

X y*{Ui) - max [250, 680] Hz compared to a predetermined defuzzification threshold D. If 

^GtfjaatojHz F k less of e ^ al to D defuzzification converts F ( 

5 to a 0. If F is greater than D, then defuzzification converts 
L2 is a measure of a lower-frequency spectrum shape in the F to a 1. The voice detector 50 generates a report 78 of the 
audio signal. For example, if the audio signal were a tone valuc F ; A valuc of 1 indicates a presence of a voice in the 
with a single frequency component of 480 Hz, then L2 audio si S naI and a value of 0 indicates voice rejection. For 
would be relatively large since the maximum value (MAX) example, if D is set to 0.97, and F is 0.93 (as above), then 
would be the value of Y, at a frequency of 480 Hz and all 10 D ^ 0 ^ 00 voice ^ detected. The value of D may be 
other frequency components would be much smaller than adjusted depending on calling location, telephone line, or ,' 
the maximum value. If, on the other hand, the audio signal membership functions. i 
corresponded to noise, then L2 would be relatively small FIG * 6 shows a flowchart for a voice detection procedure 

since the maximum value (MAX) is about the same size as 100 of FIG ; 4 Tne voice detector 50 waits for the incoming 
all other frequency components in that range. 15 sampled signal 16 (step 102). Then, the word boundary 

A third block 66 calculates feature L3, a duration Tof the detector 54 determines if the power of the signal is greater 
word: toan the WORD-THRESHOLD (step 104). If the power is 

not greater than the WORD-THRESHOLD, then the proce- 
L3=,T (5) dure advances to step 102 where the voice detector 50 waits 

13 is a measure of the length of the word. 20 foT * e sampled signal 16. 

LI, L2, and L3 arc used as input values for corresponding 'l! ?!!f n 10 f' ^ P° wer » ^ atec tban lh ° WORD " 
fuzzy set blocks A 68, B 70, and C 72. Each fuzzy set block THRESHOLD, then the spectrum accumulator 58 accumu- 
output f,. (L), where i G [A3.C] and L € [L1.L2JL3], h J < ? fre ^ en ?y spectrum components (output by block 56) 
represent a degree of membership in the fuzzy set for a ° f **, mcomm 8 < ste P 10 «> M ste P 108 ' me word 

particular value of the input feature L. The degree of 25 ^^^^^^^ T ^,fT' ^ 
membership f,(L) is a value (ranging from 0 to 1) of a " * l6 ^X ™ ST K ^ P °T KmamS 
membership function f, at potat L. Degree of membership above WORD-THRESHOLD, the procedure advances to 
f/L) shows how much the value of the feature (L) is 2 ep 104 where the s P ectrum accumulator 58 accumulates 
compatible with the proposition that the input signal 16 ''^Z^SSS^^ " I 08 ' * • 

represents human speech. FIG. 5 shows an example of a 30 f f k beI ^, W ° *? -™RBSHOLD, then the switch 60 
generalized membership function f 80 as a function of the closes « d i bl ° cks «; MM extract features LI, L2, and L3, 
feature L given in arbitrary units. For a value of L equal to «spectively (step U0) The procedure 100 advances to step 
1, (at point 82), the fuzzy set outputs a value of 0.0 which ^ whe £ bloc ^ . 68 >, . 70 > and C 72 and 

indicates that the input signal 16 does not represent human ^ nct ' on 74 P 6 * 0 ™ ^ .^analysis to determine if the 
speech. Similarly, for L equal to L, (at point 84), the fuzzy 35 SIgnal ^^P 001 ^ to «»«*»■ ™<* detector 50 gener- 
set outputs a value of 0.16 which indicates that the input ale ,L a rcport based ° n J» out P« l of junction 74 (step 114). 
signal 16 almost assuredly does not represent human speech. . Thc 5?' cms ™ d |echpiques d escnbed here may be used 
In contrast, for L equal to 1, (at point 86), the fuzzy set m any ? SP W^ 8 . 110 " ui w hich detection of a voice in an 
outputs a value of 1.0 which indicates that the input signal audl ° S1 8 nal , B for example, in any telephony or 

16 represents human speech 40 com P u ter telephony application. In computer telephony 

Before operation of the voice detector 50, the membership a PP u ? at j°us, d <f ction of a voice in an audio signal requires 
functions f,<L) are determined from a statistical analysis of a s ' atlstical that includes computer audio signals in 

typical audio signals that occur on telephone lines. For add 5 0n to tradl,10nal telephone audio signals, 
example, to determine the membership function f c (L), audio .. V*?* fy stems and may be implemented in 

signal word lengths are measured repeatedly to build a 45 f lgltal electr ° nlc circuitry, or m computer hardware, 
statistical histogram of lengths which serves as the basis for software, or in various combinations thereof, 

the membership function f (L). A shape of the membership Apparatus embodying these techniques may include appro- 
function may be changed depending on a calling location or pnate mpul and °utput devices, a computer processor, and a 
telephone line since tones used in telephone signals and com P ut er program product tangibly embodied m a machine- 
speech patterns vary widely throughout the world. 50 readable stora 8 e * vlce for execution by a programmable 

Referring again to FIG. 4, the degrees of membership processor. ... 
f A (Ll), f fl (L2), and L/L3) are combined at junction 74 using . A P rocess embodymg these techniques may be performed 
a fuzzy additive technique. For example, the fuzzy additive by a Programmable processor executing a program of 
technique may calculate an average F(AJB,Q of the indi- ms tructions to perform desired functions by operating on 
vidual degrees of membership: 5S data and generating appropriate output. The techniques 

may be implemented in one or more computer programs that 
f (£/)+■ f U2)+ f (U) <6) are executable on a programmable system including at least 

F{A, b, o = - " 3 /c — one programmable processor coupled to receive data and 

instructions from, and to transmit data and instructions to, a 
60 data storage system, at least one input device, and at least 
Using Eqn. 6, if f A (Ll)-0.93, f B (L2)-0.99, and f„(L3)-0.87, one output device. 

then F(A,B,C)-0.93. Furthermore, junction 74 may be con- Each computer program may be implemented in a high- 
figured to take a weighted average F(W / ,A,W 1( B,W C C) if level procedural or object-oriented programming language, 
certain features L are more important to voice detection than or in assembly or machine language if desired; and in any 
others- 65 case, the language may be compiled or interpreted language. 

Output F(A3,C) of junction 74 represents a final fuzzy Suitable processors include, by way of example, both gen- 
set 76 and is used for defuzzification. Defuzzification con- eral and special purpose microprocessors. Generally, a pro- 
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cessor will receive instructions and data from a read-only 
memory and/or a random access memory. Storage devices 
suitable for tangibly embodying computer program instruc- 
tions and data include all forms of non-volatile memory, 
including by way of example semiconductor memory 
devices, such as EPROM, EEPROM, and flash memory 
devices; magnetic disks such as internal hard disks and 
removable disks; magneto-optical disks; and CD-ROM 
disks. Any of the foregoing may be supplemented by, or 
incorporated in, specially-designed ASICs (application- 
specific integrated circuits). 

Other embodiments are within the scope of the following 
claims, 
f What is claimed is: 

I 1. A method of detecting a presence of a voice in an audio 

|« signal, the method comprising: 

sampling frequency components of the audio signal dur- 

\ ing a window that starts when a power of the audio 

. signal reaches a predetermined threshold and stops 
when the audio signal's power drops below the prede- 
termined threshold; 

J generating an array of elements based on the sampled 
frequency components, each element of the array cor- 
responding to a time-based sum of frequency compo- 
nents; and 

determining whether the audio signal corresponds to a 
voice based on one or more values calculated from the 
generated array, each value corresponding either to a 
frequency-based sum of array elements or to the win- 
dow. 

2. The method of claim 1, in which a value corresponding 
to a frequency-based sum of array elements is a ratio of a 
frequency-based sum of array elements in a lower frequency 
range and a frequency-based sum of array elements in a 
higher frequency range. 

3. The method of claim 1, in which a value corresponding 
to a frequency-based sum of array elements is a ratio of a 
maximum-value array element in a lower frequency range 
and a frequency-based sum of array elements in the lower 
frequency range other than the maximum-value element. 

4. The method of claim 1, further comprising, prior to 
sampling, estimating the power of the audio signal. 

5. The method of claim 1, in which determining comprises 
analyzing the calculated values using fuzzy logic. 

6. The method of claim 5, in which analyzing comprises 
generating a degree of membership in a fuzzy set for each 
value. 

7. The method of claim 6, in which the degree of 
membership represents a measure of a likelihood that the 
audio signal is a voice. 

8. The method of claim 7, in which the degree of 
membership is based on a statistical analysis of audio 
signals. 

9. The method of claim 7, in which analyzing comprises 
combining the degrees of membership for each value into a 
final value and converting the final value into a voice 
detection decision. 

10. The method of claim 9, in which converting the final 
value comprises comparing the final value to a predeter- 
mined threshold. 

11. The method of claim 1, in which the audio signal 
occurs on a telephone line. 

12. The method of claim 1, in which the audio signal 
occurs in a computer telephony line. 

13. A method of detecting a presence of a voice in an 
audio signal, the method comprising: 

generating an array of elements in which each element of 
the array corresponds to a time-based sum of frequency 
components of the audio signal; 
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calculating one or more values from the generated array; 
and 

analyzing the calculated values using fuzzy logic to 
determine whether a voice is present in the audio 
signal; 

in which at least one of the one or more values is a 
window of time that starts when a power of the audio 
signal reaches a predetermined threshold and stops 
when the audio signal's power drops below the prede- 
termined threshold. 

14. The method of claim 13, in which analyzing com- 
prises generating a degree of membership in a fuzzy set for 
each value. 

15. The method of claim 14, in which the degree of 
membership represents a measure of a likelihood that the 
audio signal is a voice. 

16. The method of claim 15, in which the degree of 
membership is based on a statistical analysis of audio 
signals. 

17. The method of claim 15, in which analyzing com- 
prises combining the degrees of membership for each value 
into a final value and converting the final value into a voice 
detection decision. 

18. The method of claim 17, in which converting the final 
value comprises comparing the final value to a predeter- 
mined threshold. 

19. The method of claim 13, in which the audio signal 
occurs on a telephone line. 

20. The method of claim 13, in which the audio signal 
occurs on a computer telephony line. 

21. A method of detecting a presence of a voice in an 
audio signal, the method comprising: 

generating an array of elements in which each element of 
the array corresponds to a time-based sum of frequency 
components of the audio signal; 
calculating one or more values from the generated array; 
and 

analyzing the calculated values using fuzzy logic to 
40 determine whether a voice is present in the audio 
signal; 

in which at least one of the one or more values is a ratio 
of a frequency-based sum of array elements in a lower 
frequency range and a frequency-based sum of array 
45 elements in a higher frequency range. 

22. The method of claim 21, in which analyzing com- 
prises generating a degree of membership in a fizzy set for 
each value. 

23. The method of claim 22, in which the degree of 
50 membership represents a measure of a likelihood that the 

audio signal is a voice. 

24. The method of claim 23, in which the degree of 
membership is based on a statistical analysis of audio 
signals. 

55 25. The method of claim 23, in which analyzing com- 
prises combining the degrees of membership for each value 
into a final value and converting the final value into a voice 
detection decision. 

26. The method of claim 25, in which converting the final 
60 value comprises comparing the final value to a predeter- 
mined threshold. 

27. The method of claim 21, in which the audio signal 
occurs on a telephone line. 

28. The method of claim 21, in which the audio signal 
65 occurs on a computer telephony line. 

29. A method of detecting a presence of a voice in an 
audio signal, the method comprising: 
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generating an array of elements in which each element of calculating two or more values from the generated array 

the array corresponds to a time-based sum of frequency including a first value corresponding to a ratio of a 

components of the audio signal; frequency-based sum of array elements in a lower 

calculating one or more values from the generated array; frequency range and a frequency-based sum of array 

and 5 elements in a higher frequency range, and another value 

analyzing the calculated values using fuzzy logic to corresponding to a ratio of a maximum-value array 
determine whether a voice is present in the audio element in the lower frequency range and a frequency- 
signal; based sum of array elements in the lower frequency 

in which at least one of the one or more values is a ratio ran S e olher man & e maximum-value element; and 

of a maximum-value array element in the lower fre- i 0 analyzing the calculated values and the window using 

quency range and a frequency-based sum of array fazzy logic to determine whether a voice is present in 

elements in the lower frequency range other than the me audio signal. 

maximum-value element 42. The method of claim 41, in which determining com- 

30. The method of claim 29, in which analyzing com- P rises analyzing the calculated values using fuzzy logic, 
prises generating a degree of membership in a fuzzy set for 43. The method of claim 42, in which analyzing corn- 
each value. prises generating a degree of membership in a fuzzy set for 

31. The method of claim 30, in which the degree of each value. 

membership represents a measure of a likelihood mat the 44. The method of claim 43, in which the degree of 

audio signal is a voice. membership represents a measure of a likelihood that the 

32. The method of claim 31, in which the degree of audio signal is a voice. 

membership is based on a statistical analysis of audio 20 45. Hie method of claim 44, in which the degree of 

signals. membership is based on a statistical analysis of audio 

33. The method of claim 31, in which analyzing com- signals. 

prises combining the degrees of membership for each value 46 The melhod of claim 44j in which ana iyzing com- 

into a final value and converting the final value into a voice prises combining me degrees of mem bership for each value 

detection decKion. 25 mt0 a final value and converting the final value into a voice 

34. i ne method of claim 33, m which converting the final detection decision 

mted ftSSd C ° mParing th<i ValUe 10 ' Piedeter " 47 ' mcthod 0f Claim 46 ' in which ""^ lhe 

35. The method of claim 29, in which the audio signal value comprises comparing the final value to a predeter- 
occurs on a telephone line. 30 mined ttireshold. 

36. The method of claim 29, in which the audio signal 48 ^ method of claun 41 - m wmch lhe audio s[ ^ &l 
occurs on a computer telephony line. occurs on a telephone line. 

37. A method of detecting a presence of a voice on an 49. The method of claim 41, in which the audio signal 
audio signal, the method comprising: occurs on a computer telephony line. 

generating an array of elements in which each element of 3S 50. A voice detector which detects a presence of a voice 

the array corresponds to a time-based sum of frequency i° an aucuo signal, the detector comprising: 

components of the audio signal; a word boundary detector that defines a window that starts 

calculating two or more values from the generated array when a power of the audio signal reaches a predeter- 

including a first value corresponding to a ratio of a mined threshold and stops when the audio signal's 

frequency-based sum of array elements in a lower ^ power drops below the predetermined threshold; 

frequency range and a frequency-based sum of array a frequency transform that transforms, during the window, 

elements in a higher frequency range, and second value me audio signa j ^ a sequence of frequency compo- 

corresponding to a ratio of a maximum-value array nents in discrete time intervals; 

element in the lower frequency range and a frequency- a mm accumulator lhat calculates> during lhe 

based sum of array elements in the lower frequency . , . , - - & 

range other than the maximum-value elemenUand « ^ ' I- tune - b 5 sed ^ of fre f enc y components 

i . * i *. j i . j i ... for each discrete frequency interval; 

analyzing the calculated values to determine whether a . ; , 

voice is present in the audio signal. a P arameter extractor that calculates one or more values, 

38. The method of claim 37, in which a third value is a each value corresponding either to a frequency-based 
time window that starts when a power of the audio signal ^ of an out P ut of the spectrum accumulator or to the 
reaches a predetermined threshold and stops when the audio 50 window; and 

signal's power drops below the predetermined threshold. a decision element that determines whether the audio 

39. The method of claim 37, in which analyzing com- signal corresponds to a voice based on output of the 
prises using fuzzy logic to determine a measure of a likeli- parameter extractor. 

hood that the audio signal is a voice. 51. The voice detector of claim 50, in which the decision 

40. The method of claim 39, in which analyzing com- 5S element comprises, for each extracted value, a fuzzy set 
prises a statistical analysis of audio signals. block that determines a measure of a likelihood that the 

41. A method of detecting a presence of a voice on an audio signal is a voice. 

audio signal, the method comprising: 52. The voice detector of claim 51, in which the decision 

sampling frequency components of the audio signal dur- element comprises a junction that combines the outputs of 

ing a window that starts when a power of the audio eo the fuzzy set blocks and compares this combination to a 

signal reaches a predetermined threshold and stops predetermined threshold. 

when the audio signal's power drops below the prede- 53. Computer software, stored on a computer-readable 
termined threshold; medium, for a voice detection system, the software corn- 
generating an array of elements based on the sampled prising instructions for causing a computer system to per- 
frequency components, each element of the array cor- 65 form the following operations: 

responding to a time-based sum of frequency compo- sample frequency components of the audio signal during 

nents; a window that starts when a power of the audio signal 
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reaches a predetermined threshold and stops when the 
audio signal's power drops below the predetermined 
threshold; 

generate an array of elements based on the sampled 
frequency components, each element of the array cor- 5 
responding to a time -based sum of frequency compo- 
nents; and 
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determine whether the audio signal corresponds to a voice 
based on one or more values calculated from the 
generated array, each value corresponding either to a 
frequency-based sum of array elements or to the win- 
dow. 

***** 
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